AITopics | mutation score

Collaborating Authors

mutation score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Lops, Andrea, Narducci, Fedelucio, Ragone, Azzurra, Trizio, Michelantonio, Bartolini, Claudio

arXiv.org Artificial IntelligenceNov-27-2025

Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.20403

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Apulia > Bari (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(17 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mutation Testing for Industrial Robotic Systems

Santos, Marcela Gonçalves dos, Hallé, Sylvain, Petrillo, Fábio

arXiv.org Artificial IntelligenceNov-19-2025

Industrial robotic systems (IRS) are increasingly deployed in diverse environments, where failures can result in severe accidents and costly downtime. Ensuring the reliability of the software controlling these systems is therefore critical. Mutation testing, a technique widely used in software engineering, evaluates the effectiveness of test suites by introducing small faults, or mutants, into the code. However, traditional mutation operators are poorly suited to robotic programs, which involve message-based commands and interactions with the physical world. This paper explores the adaptation of mutation testing to IRS by defining domain-specific mutation operators that capture the semantics of robot actions and sensor readings. We propose a methodology for generating meaningful mutants at the level of high-level read and write operations, including movement, gripper actions, and sensor noise injection. An empirical study on a pick-and-place scenario demonstrates that our approach produces more informative mutants and reduces the number of invalid or equivalent cases compared to conventional operators. Results highlight the potential of mutation testing to enhance test suite quality and contribute to safer, more reliable industrial robotic systems.

artificial intelligence, mutation operator, operator, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.4204/EPTCS.436.5

2511.14432

Country:

South America > Uruguay > Maldonado > Maldonado (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
(9 more...)

Genre: Research Report (0.82)

Industry: Government (0.98)

Technology: Information Technology > Artificial Intelligence > Robots > Robots in the Workplace (1.00)

Add feedback

Combining TSL and LLM to Automate REST API Testing: A Comparative Study

Barradas, Thiago, Paes, Aline, Neves, Vânia de Oliveira

arXiv.org Artificial IntelligenceSep-9-2025

The effective execution of tests for REST APIs remains a considerable challenge for development teams, driven by the inherent complexity of distributed systems, the multitude of possible scenarios, and the limited time available for test design. Exhaustive testing of all input combinations is impractical, often resulting in undetected failures, high manual effort, and limited test coverage. To address these issues, we introduce RestTSLLM, an approach that uses Test Specification Language (TSL) in conjunction with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. The approach targets two core challenges: the creation of test scenarios and the definition of appropriate input data. The proposed solution integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs on their ability to generate tests from OpenAPI specifications. The evaluation focused on metrics such as success rate, test coverage, and mutation score, enabling a systematic comparison of model performance. The results indicate that the best-performing LLMs - Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), and Sabia 3 (Maritaca) - consistently produced robust and contextually coherent REST API tests. Among them, Claude 3.5 Sonnet outperformed all other models across every metric, emerging in this study as the most suitable model for this task. These findings highlight the potential of LLMs to automate the generation of tests based on API specifications.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.0554

Country:

South America > Brazil > Pernambuco > Recife (0.05)
North America > United States (0.04)
South America > Brazil > Rio de Janeiro > Niterói (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models

Walczak, Jakub, Tomalak, Piotr, Laskowski, Artur

arXiv.org Artificial IntelligenceJul-22-2025

--Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy -- applied even to'reasoning' models -- achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. ECENT years have brought significant advancements in artificial intelligence (AI), particularly in the areas of performance and productivity enhancement. However, AI -- and particularly large language models (LLMs) -- still suffer from several weaknesses. Among them, convincing but senseless content generation ('hallucination'), safety misalignment ('ethicality') [1], unfairness [2], and limited processing context are the most critical. In spite of these restrictions, and bearing in mind the limited and merely apparent creativity of LLMs [3], they have become versatile tools already widely used across a variety of domains (creative industries [4], entertainment, reporting, and software engineering [5] are just cases in point) for multiple tasks.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.14256

Country:

Europe > Poland > Łódź Province > Łódź (0.04)
Europe > Italy (0.04)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.49)

Add feedback

On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering

Lyons, Lauren, Ghanbari, Ali

arXiv.org Artificial IntelligenceJan-21-2025

Mutation analysis of deep neural networks (DNNs) is a promising method for effective evaluation of test data quality and model robustness, but it can be computationally expensive, especially for large models. To alleviate this, we present DEEPMAACC, a technique and a tool that speeds up DNN mutation analysis through neuron and mutant clustering. DEEPMAACC implements two methods: (1) neuron clustering to reduce the number of generated mutants and (2) mutant clustering to reduce the number of mutants to be tested by selecting representative mutants for testing. Both use hierarchical agglomerative clustering to group neurons and mutants with similar weights, with the goal of improving efficiency while maintaining mutation score. DEEPMAACC has been evaluated on 8 DNN models across 4 popular classification datasets and two DNN architectures. When compared to exhaustive, or vanilla, mutation analysis, the results provide empirical evidence that neuron clustering approach, on average, accelerates mutation analysis by 69.77%, with an average -26.84% error in mutation score. Meanwhile, mutant clustering approach, on average, accelerates mutation analysis by 35.31%, with an average 1.96% error in mutation score. Our results demonstrate that a trade-off can be made between mutation testing speed and mutation score error.

artificial intelligence, machine learning, neuron, (16 more...)

arXiv.org Artificial Intelligence

2501.12598

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Alabama > Lee County > Auburn (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

VALTEST: Automated Validation of Language Model Generated Test Cases

Taherkhani, Hamed, Hemmati, Hadi

arXiv.org Artificial IntelligenceNov-12-2024

Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases. However, the validation of LLM-generated test cases remains a challenge, particularly when the ground truth is unavailable. This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities. We evaluate VALTEST using nine test suites generated from three datasets (HumanEval, MBPP, and LeetCode) across three LLMs (GPT-4o, GPT-3.5-turbo, and LLama3.1 8b). By extracting statistical features from token probabilities, we train a machine learning model to predict test case validity. VALTEST increases the validity rate of test cases by 6.2% to 24%, depending on the dataset and LLM. Our results suggest that token probabilities are reliable indicators for distinguishing between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases in software testing. In addition, we found that replacing the identified invalid test cases by VALTEST, using a Chain-of-Thought prompting results in a more effective test suite while keeping the high validity rates.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2411.08254

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Estonia > Harju County > Tallinn (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MILE: A Mutation Testing Framework of In-Context Learning Systems

Wei, Zeming, Zhang, Yihao, Sun, Meng

arXiv.org Artificial IntelligenceSep-7-2024

In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at https://github.com/weizeming/MILE.

dataset, demonstration, mutator, (12 more...)

arXiv.org Artificial Intelligence

2409.04831

Country:

Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Asia > China > Beijing > Beijing (0.04)
Europe > France (0.04)

Genre: Research Report (1.00)

Industry: Social Sector (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automated Test Case Generation Using Code Models and Domain Adaptation

Hashtroudi, Sepehr, Shin, Jiho, Hemmati, Hadi, Wang, Song

arXiv.org Artificial IntelligenceAug-15-2023

State-of-the-art automated test generation techniques, such as search-based testing, are usually ignorant about what a developer would create as a test case. Therefore, they typically create tests that are not human-readable and may not necessarily detect all types of complex bugs developer-written tests would do. In this study, we leverage Transformer-based code models to generate unit tests that can complement search-based test generation. Specifically, we use CodeT5, i.e., a state-of-the-art large code model, and fine-tune it on the test generation downstream task. For our analysis, we use the Methods2test dataset for fine-tuning CodeT5 and Defects4j for project-level domain adaptation and evaluation. The main contribution of this study is proposing a fully automated testing framework that leverages developer-written tests and available code models to generate compilable, human-readable unit tests. Results show that our approach can generate new test cases that cover lines that were not covered by developer-written tests. Using domain adaptation, we can also increase line coverage of the model-generated unit tests by 49.9% and 54% in terms of mean and median (compared to the model without domain adaptation). We can also use our framework as a complementary solution alongside common search-based methods to increase the overall coverage with mean and median of 25.3% and 6.3%. It can also increase the mutation score of search-based methods by killing extra mutants (up to 64 new mutants were killed per project in our experiments).

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.08033

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.75)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

DeepMetis: Augmenting a Deep Learning Test Set to Increase its Mutation Score

Riccio, Vincenzo, Humbatova, Nargiz, Jahangirova, Gunel, Tonella, Paolo

arXiv.org Artificial IntelligenceSep-15-2021

Deep Learning (DL) components are routinely integrated into software systems that need to perform complex tasks such as image or natural language processing. The adequacy of the test data used to test such systems can be assessed by their ability to expose artificially injected faults (mutations) that simulate real DL faults. In this paper, we describe an approach to automatically generate new test inputs that can be used to augment the existing test set so that its capability to detect DL mutations increases. Our tool DeepMetis implements a search based input generation strategy. To account for the non-determinism of the training and the mutation processes, our fitness function involves multiple instances of the DL model under test. Experimental results show that \tool is effective at augmenting the given test set, increasing its capability to detect mutants by 63% on average. A leave-one-out experiment shows that the augmented test set is capable of exposing unseen mutants, which simulate the occurrence of yet undetected faults.

etis, mutation operator, operator, (15 more...)

arXiv.org Artificial Intelligence

2109.07514

Country:

North America > United States > New York > New York County > New York City (0.05)
Oceania > Australia (0.04)
North America > United States > Tennessee > Shelby County > Memphis (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Sentinel: A Hyper-Heuristic for the Generation of Mutant Reduction Strategies

Guizzo, Giovani, Sarro, Federica, Krinke, Jens, Vergilio, Silvia Regina

arXiv.org Artificial IntelligenceMar-12-2021

Mutation testing is an effective approach to evaluate and strengthen software test suites, but its adoption is currently limited by the mutants' execution computational cost. Several strategies have been proposed to reduce this cost (a.k.a. mutation cost reduction strategies), however none of them has proven to be effective for all scenarios since they often need an ad-hoc manual selection and configuration depending on the software under test (SUT). In this paper, we propose a novel multi-objective evolutionary hyper-heuristic approach, dubbed Sentinel, to automate the generation of optimal cost reduction strategies for every new SUT. We evaluate Sentinel by carrying out a thorough empirical study involving 40 releases of 10 open-source real-world software systems and both baseline and state-of-the-art strategies as a benchmark. We execute a total of 4,800 experiments, and evaluate their results with both quality indicators and statistical significance tests, following the most recent best practice in the literature. The results show that strategies generated by Sentinel outperform the baseline strategies in 95% of the cases always with large effect sizes. They also obtain statistically significantly better results than state-of-the-art strategies in 88% of the cases, with large effect sizes for 95% of them. Also, our study reveals that the mutation strategies generated by Sentinel for a given software version can be used without any loss in quality for subsequently developed versions in 95% of the cases. These results show that Sentinel is able to automatically generate mutation strategies that reduce mutation testing cost without affecting its testing effectiveness (i.e. mutation score), thus taking off from the tester's shoulders the burden of manually selecting and configuring strategies for each SUT.

mutation score, operator, sentinel, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TSE.2020.3002496

2103.07241

Country:

Europe > United Kingdom (0.14)
South America > Uruguay > Maldonado > Maldonado (0.04)
South America > Brazil > Paraná > Curitiba (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback